Phase-corrected RASTA for automatic speech recognition over the phone

نویسندگان

  • Johan de Veth
  • Lou Boves
چکیده

In th is paper we propose an extension to the classical RASTA technique. The new m ethod consists of classi­ cal RASTA filtering followed by a phase correction op­ eration. In th is m anner, the influence of th e commu­ nication channel is as effectively removed as w ith clas­ sical RASTA. However, our proposal does not introduce a left-context dependency like classical RASTA. There­ fore the new m ethod is b e tte r suited for autom atic speech recognition based on context-independent modeling w ith Gaussian m ixture hidden Markov models. We tested th is in the context of connected digit recognition over the phone. In case we used context-dependent hidden Markov mod­ els (i.e. word models), we found th a t classical RASTA and phase-corrected RASTA performed equally well. For context-independent phone-based models, we found th a t phase-corrected RASTA can outperform classical RASTA depending on the acoustic resolution of the models. 1. IN T R O D U C T IO N For autom atic speech recognition (ASR) over the telephone it is well-known th a t the recognition performance may be seriously degraded due to the transfer characteristics of the handset microphone and the telephone channel [1]. In or­ der to reduce the influence of the linear filtering effect of the communication channel, different channel normalisation (CN) techniques have been proposed (for example [2, 3, 4]). In our paper we present a new, extended version of the classical RASTA filtering technique [3]. Classical RASTA filtering features two im portant proper­ ties: (1) attenuation a t low m odulation frequencies and (2) enhancem ent of the dynam ic parts of the spectrogram [3]. The first property explains why classical RASTA filtering is such an effective m ethod for CN: In the cepstral or logenergy domain, linear filtering by a quasi-stationary com­ munication channel gives rise to an additive constant bias term [1], The attenuation a t low m odulation frequencies effectively removes th is DC-component. It has been sug­ gested th a t the second property is also benificial for good recognition performance [3]. Recently, it was shown th a t the enhancement of the dynam ic parts of the spectrogram obtained by classical RASTA represents a crude approxim a­ tion of th e effects of tem poral forward masking in hum an auditory perception [5, 6]. Thus, classical RASTA may be viewed as a com bination of CN and a crude model of human auditory time-masking. The m ethod we propose consists of classical RASTA fil­ tering followed by a phase correction operation. The phase correction is chosen such th a t the frequency-dependent non­ linear phase-shift of the classical RASTA filter is compen­ sated, while a t th e same tim e preserving the original mag­ nitude response of the classical RASTA filter [7]. In this m anner phase-corrected RASTA effectively removes the in­ fluence of the communication channel and a t the same tim e does not enhance the dynam ic parts of the spectrogram (i.e. does not model hum an auditory tim e-masking). In addi­ tion, phase-corrected RASTA removes the well-known leftcontext dependency introduced by classical RASTA. There­ fore, one may expect th a t the new CN m ethod is better suited for ASR based on context-independent (Cl) model­ ing. This paper is organised as follows. In section 2 we de­ scribe details of the phase-corrected RASTA m ethod. We will focus on the non-linear phase distortion introduced by classical RASTA and describe the m ethod we used to re­ store the original phase. Next, in section 3, the signal processing for our experim ents is described. The telephone database th a t we used for our experim ents is discussed in section 4. After this, the topology of the hidden Markov models (HMMs), the way we performed training w ith cross­ validation and the recognition syntax during testing are described in section 5. The results of our recognition ex­ perim ents are discussed in section 6. As we will see, these experim ents show th a t removal of the phase distortion of th e RASTA filter leads to a significant increase of recogni­ tion performance when using C l HMMs. Finally, in section 7 we sum up the main conclusions. 2. P H A S E -C O R R E C T E D R A S T A Consider the signal shown in the upper panel of Figure 1 (we took a synthetic signal instead of a real MFCC coordinate tim e series for didactic purposes). The signal is a sequence of seven stationary segments (’’speech sta tes” ) preceded and followed by a rest sta te (’’silence” ). Notice th a t the signal contains a constant overall DC-component (representing the effect of the communication channel). The RASTA filtered version of th is signal is shown in the middle panel of Figure 1. Two im portant observations can be made. First, the DC-component has been effectively removed (at least for tim es larger than , say, 70 frames). Second, the shape of the signal has been altered. In Proc. ICASSP-97, Apr 21-24, Munich, Germ,any, pp. 1239-1242, 1997 W ith regards to the shape distortion the following can be noticed. First, the seven speech states of th e signal th a t had a constant am plitude are now no longer stationary. In­ stead, th e am plitude for each sta te shows a tendency to drift towards zero. Thus: RASTA filtering steadily decreases the value of cepstral coefficients in stationary parts of the speech signal, while the values im m ediately after an abrupt change are preserved. This explains the observation th a t the dynam ic parts in th e spectrogram of a speech signal are enhanced by RASTA filtering[3]. As a consequence of th is drift, however, a description of the signal in term s of stationary states w ith well-located means and small vari­ ances becomes less accurate. Second, the mean am plitude of each sta te has become a function of the sta te itself as well as the am plitudes of states im m ediately preceding it. This is the well-known left-context dependency introduced by the RASTA filter [3]. Because the absolute ordering of signal am plitudes is lost, states can no longer be straight­ forwardly characterised by their mean am plitude (compare speech states two, four and seven before and after RASTA filtering in the upper and middle panel of Figure 1). For th is reason, RASTA is less well suited when using C l models (cf. the rem arks in [3]). Finally, we mention a th ird aspect of the shape distortion for completeness (which we feel is less im portant though). Due to the small attenuation of high-frequency components, ab rup t am plitude changes are smoothed. 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 0 50 100 150 200 250 300 350 400 time (frames) — > F igu re 1: Synthetic signal representing one of the cep­ stral coefficients in the feature vector. Upper panel: Orig­ inal signal containing a tim e-invariant DC-offset. Middle panel: RASTA filtered signal. Lower panel: Phase cor­ rected RASTA filtered signal. The complex frequency response of the classical RASTA filter H r (uj) may be w ritten as H r {u ) = \HR(cu)\e]4>r M (1) w ith uj th e m odulation frequency (in radians), |.ff.R(w)| the RASTA m agnitude response and 4>r (ui) the RASTA phase response. The log-magnitude and phase response of the classical RASTA filter w ith integration factor a = —0.94 are shown in Figures 2a,b for m odulation frequencies in the range 0 — 20 Hz. This range includes the 2 — 16 Hz region, which has been shown to be most im portant for good recognition by hum ans [8]. From Figure 2b, it can be seen th a t the phase response is non-linear for modulation frequencies below approxim ately 3 Hz. As we will see, the non-linear phase response of the classical RASTA filter is th e main cause of the shape distortions observed in the middle panel of Figure 1. In order to com pensate the phase distortion of the RASTA filter, while a t the same tim e preserving the original m agnitude response, we followed the procedure suggested in [9]. After the classical RASTA filter, an all-pass filter can be applied such th a t its phase response 4>pc(ui) is exactly the opposite of the phase response of the RASTA filter

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Channel Norm Alisation Using Phase-corrected Rasta

Recently, we proposed an extension to the classical RASTA technique. The new method consists of classical RASTA ltering followed by a phase correction operation. In this manner, the innuence of the communication channel is as eeectively removed as with classical RASTA. However, our proposal does not introduce a left-context dependency like classical RASTA. Therefore the new method is better sui...

متن کامل

Continuous Speech Recognition with Phase-corrected Rasta

Phase-corrected RASTA is a new technique for channel normalisation that consists of classical RASTA filtering followed by a phase correction operation. In this manner, the channel bias is as effectively removed as with classical RASTA, without introducing a left-context dependency. The performance of the phase-corrected RASTA channel normalization technique was evaluated for a continuous speech...

متن کامل

Effectiveness of phase-corrected rasta for continuous speech recognition

Phase-corrected RASTA is a new technique for channel nor­ malization that consists o f classical RASTA filtering followed by a phase correction operation. In this manner, the channel bias is as effectively removed as with classical RASTA, with­ out introducing a left context dependency. The performance o f the phase-corrected RASTA channel normalization technique was evaluated for a continuous ...

متن کامل

Comparison of channel normalisation techniques for automatic speech recognition over the phone

We compared three different channel normalisation (CN) methods in the context of a connected digit recognition task over the phone: ceptrum mean substraction (CMS), RASTA filtering and the Gaussian dynamic cepstrum reprsentation (GDCR). Using a small set of context-independent (CI) continuous Gaussian mixture hidden Markov models (HMMs) we found that CMS and RASTA outperformed the GDCR techniqu...

متن کامل

Speech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions

Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC) in a speech emotion recognition system. We investigate its perfor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997